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(57) According to MPEG-4's TTS architecture, taaal 
animation can be driven Dy two streams simultaneously 
- text ana Facial Animation Parameters in this aictir 
lecture, text input is sent to a Texi-Tc-Speecn convener 
at a decoder mat drives tne mouth shapes ot me face. 
Facial Animation Parameters are sent 1rom an encoder 
to the tace over tne commendation channel Tne 
present invention includes codes (Known as book- 
marks) in the text string transmitted to the Text-to- 
Speecn convener when bookmaiks are placed oe- 
tween words as welt as inside tnem According to the 
present invention, the DooKrnarKs carry an encoder lime 



stamp Due to the nature of text-io-speech conversion, 
tne encoder time stamp does not relate to real-world 
itme and should be interpreted as a counter in addition, 
the Facial Animation Parameter stream carries the 
same encoder time stamp tound m the bookmark ot the 
text Tne system ot the present invention Teads the 
booKmarKand provides the encooer time stamp as well 
as a reat-limetime stamp to the facial animation system. 
Finally the facial animation system associates the cor- 
rect facial animation parameter with tne reai-nrne t,me 
stamp usmg the encoder time stamp ot the pookmarK 
as a reterence 
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Description 

RACKGROUN " ™ ThE INVENTION 

[0001] Trie present invention relaxes generally to 
methods and systems tor coding of images, ana more 
particularly to a method and system tor coa»ng images 
of tacia i an imat ion 

[0002] According 10 MPEG^s ITS architecture, fa- 
cial animation can be driven by two streams s«muttane- 
ousiy - text, and Facial Animation Parameters (FAPs) 
tn this architecture, text .npul is sent to a Text-To- 
Speech (ITS) converter at a decoder mat drives me 
moutn shapes of tne race FaPs are *ero from an cm- 
coder to tne face over the communication channel Cur- 
rently, the verification Model (VM) assumes mat syn- 
chronization between tne input side and tne FAP input 
stream is ©warned Dy means ot nmin9 miecteo at me 
Transmitter side However, tne transmitter does not 
Know me uming ot me decoder TTS Hence, the encoder 
cannot specify me alignment oetween synthesized 
words and me tacial animation. Furtnermore, timing var- 
ies between different TTS systems Thus, mere current- 
ly is no method of aligning facial mimics (e g.. smiles, 
and expressions) with speech 

10003] The present invention is therefore directed to 
me problem ot developing a system and method for cod- 
,ng images for facial animation mat enables alignment 
of facial mimics with speech generated at the decoder. 

SUMMARY OF THE IN VENTION 



[0004] The present invention solves this problem by 
including codes (Known as bookmarks) m me text string 
transmitted to me Texi-to-Speech (TTS) convener, 
which bookmarks can be placed between words as well 
as inside tnem According to me present invention, me 
bookmarks carry an encoder tone stamp (ETS). Due to 
the nature of texwo-speech conversion, me encoder 
lime stamp does not relate to real-world time, and 
should be interpreted as a counter in addition, accord- 
ing to the present invention, me Facial Animation Pa- 
rameter (FAP) stream cames me same encoder time 
stamp found in tne bookmark ot me text The system of 
me present invention reads me oookmarkand provides 
xne encoder time stamp as wen as a reaKtme time 
stamp (RTS) derived from me timingot its TTS oonvertei 
tome \<>c*A\ animation system Finally, the tac**t anima- 
tion system associates me correct facial animation pa- 
rameter with the reai-ume time stamp using the encoder 
Time stamp ot the bookmark as a reference in order to 
prevent conflicts between the encoder t»me stamps and 
tne reai-t»me time stamps, the encoder time stamps 
nave to oe chosen such that a wide range of decoders 
can operate 

[0005] Therefore, in accordance with me present in- 
vention, a method Tor encoding a facial animation mdud- 



a text stream, comprises me steps ot assigning a pre- 
oetermtned code to the at least one tacial mirme, ana 
placing me predetermined code within me text stream, 
where.n said code indicates a presence ot a particular 
5 facial mimic. The predetermined code is a unique es- 
cape sequence that does not interfere with me normal 
operation ot a text-to-speecn synthesizer 
[0006] One possible embodiment of mis method uses 
me predetermined code as a pointer loa stream ot facial 
is mimics thereby indicating a synchronization relationship 
between me text stream and me tacial mimic stream 
[0007] One possible implementation ot me predeter- 
mined code is an escape sequence, followed by a plu- 
rality of bits, which define one of a set of facial mimics 
?5 m mis case, the predetermined code can be placed m 
between words m me text stream, or in between letters 
in me text stream 

[0008] Another method according to the present in- 
vention tor encoding a facial animation includes the 
zo steps of creating a text stream, creating a facial mime 
stream, inserting a plurality ot pointers in thB text stream 
pointing to a corresponding plurality of facial mimes m 
tne facial mimic stream, wherem said plurality of point- 
ers establish a synchronization relationship with said 
zs text and said facial mimics 

[0009] According to me present invention, a method 
for decoding a facial animation including speech and at 
least one tacial mimic includes the steps ot monitoring 
a text stream for a set of predetermined codes corre- 
30 sponding to a set ot facial mimics, and sendrtg a signal 
to a visual decoder to stan a particular facial mimic upon 
detecting me presence of one of the set ot predeter- 
mined codes. 

[0010] According to me present Invention, an appara- 
3$ tus for decoding an encoded animation includes a de- 
multiplexer receiving me encoded animation, outputtmg 
a text stream and a facial animation parameter stream, 
wherein said text stream includes a plurality of cooes 
indicating a synchronization relationship with a plurality 
40 of mimes m me tacial animation parameter stream and 
me text »n me text stream, a text to speech convener 
coupled to the demultiplexer, converting me text stream 
to speecn. outputtmg a plurality of phonemes, and a plu- 
rality of reaMtroe time stamps and the plurality ot cooes 
*5 m a one-to-one correspondence, whereby the plurality 
of reai-time time stamps and me plurality ot codes indi- 
cate a synchronization relationship between me plurality 
ot mimics and the piuiality ot phonemes. <and a phoneme 
to video convener being coupled to the text to speech 
so converter, synchronizing a piuralrty of facial mimes with 
me plurality of phonemes based on me pluraUty ot real- 
time time stamps and the plurairty of codes 
[0011] in ine above apparatus, it is particularly advan- 
tageous it me phoneme to video converter includes a 
55 tacial animator creating a wireframe image based on me 
synchronized plurality of phonemes and the pluiattty ot 
facial mimes, and a visual decoder being coupled to the 
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video image based on me wireframe image 

RRIFF DESCRIPTION CF THF DRAWINGS 

[0Q12] FIG l depcts me environment m wmch the 
present invention will be applied 
[0013] FiG 2 depicts me architecture oi an mPEG-4 
qecodei using text-to-speech conversion. 

DETAILED DESCRIPTION 

[0014] According to the present invention, the syn- 
chronization oi ihe decoder system can oe acmeved by 
using local synchronization t>y means of event barters 
at tne input ot FA/AP/MP and me audio decoder Alter- 
natively, a global synchronization control can be irnple- 
mented 

[0015] A maximum drift ot eo msec between me en- 
coder time stamp tETS) m tne text and me ETS ,n tne 
Facial Animation Parameter (FAP) stream is tolerable. 
[0016] One embodiment for the syntax ot tne book- 
marKs when placed m the text stream consists of an es- 
cape signal followed oy the oookmark content, e o. . \! M 
{bookmark content). The bookmark content carries 8 
16-brt integer time stamp ETS and additional mtorma- 
non The same ETS & added to me corresponding FAP 
stream to enable synchronisation The class of Facial 
An.mat.on Parameters is extended to carry the optional 
ETS 

[001 7] It an absolute ctock reterence (PCR) is provid- 
ed a drill compensation scheme can be implemented 
Please note, mere is no master slave notion between 
tne FAP stream and the text This is because the decod- 
er might dec.oe to vary me speed of tne text as well as 
a variation ol facial animation might become necessary, 
it an avaiai reacts to visual events nappenm9 in its en- 
vironment. 

[0018] For example, it Avatar l is talking to the user. 
A new Avatar enters the room A natural reaction ot av- 
atar 1 is to look at avatar 2, smile and wnile doing so. 
slowing down the speed oi the spoken text 

Autonomous Animation Driven Mostly by Text 



[0019] in tne case ot taciai animation driven by text, 
me additional animation oi me race is mostly restricted 
10 events mat do not nave to be animated ai a rate or 
30 names per second Especially nigh-level action units 
hke sm.le snould be denned at a much lower rate. Fur- 
thermore, the decoder can do the interpolation between 
different action units witnout tight control from tne re- 
ceiver 

[0O2O] The present invention includes action units to 
De animated and tneir intensity m the &od.uonai ^for- 
mation of the bookmarks The decoder is required to m- 
lerpoiate between tne acxion units and their intensities 
between consecutive bookmarks 



imations using simple tools, such as text editors, and 
significant savings m bandwidth 
[0022] FIG 1 depicts tne environment m wnch me 
present invention is to be used The animation iscreated 
5 and coded m me encoder section 1 The encoded ani- 
maton is men sent through a communication channel 
(or storage) to a remote destination At me remote des- 
tination, the animation is recreated by me decoder 2 At 
mis stage, the decoder 2 must synchronize the facial 
lO animations with me speech of the avatar using only in- 
formation encoded with me original animation 
[0023] FtG 2 depicts me MPEG-4 architecture of me 
decoder, which has been modified to operate according 
10 me present invention. The signal from the encoder l 
is (not shown) enters me Demultiplexer (DMUXy 3 via the 
transmission cnannei (or storage, which can aiso be 
modeled as a cnannei) Tne DMUX 3 separates outs tne 
text and me video data, as well as me control and aux- 
diary information Tne FaP stream, wnicn includes tne 
Encoder Time Stamp (ETS), is ajsooutpul by me DMUX 
3 directly to me FA/AP/MP 4, which is coupled to tne 
Texi-io- Speech Converter (TTS) 5, a Phoneme FAP 
convenei 6. a compositor 7 and a visual decoder a A 
Lap Shape Anaiy2er 9 is coupled to the visual decoder 
25 8 and tne TTS 5. user input enters v»a me compositor 
7 and is output to me TTS 5 and the FA/AP/MP 4. These 
events include start, stop, etc 

[0024] The TTS 4 reads the bookmarks, and outputs 
me phonemes along with me ETS as well as with a Real- 
so time Time Stamp (FTTS) tome Phoneme FAP Convener 
6. Thephonemesareusedtoputtheventcesoffiewiie- 
f rame m me correct places At mis point me image is not 
rendered 

[0025] This data is men output to me visual decoder 
3S 8 which renders me image, and outputs the image m 
vHaeofcKmtomeccxnpositor7 it *sm mis stage mat the 
FAPs are aligned with me phonemes by synchronizing 
me phonemes with me same ETS/RTS combination 
with the corresponding FAP with the matching ETS. 
40 [0026] The text input to me MPEG-4 hybrid text-to- 
speech (TTS) converter 5 is output as coded speecn to 
an audO decoder 10. in mis system, the audio decoder 
i0 outputs speech to me compositor 7. which acts as 
tne interlace io me video oispiay (not shown) and me 
45 speakers (not shown), as wen as to the user 

[0027] On me video stoe. video data output oy me 
DMUX 3 is passed to me visual decode. 8. wnicn ere- 
ates the composite video signal based on the videodaia 
and tne output tiom me FA/AP/MP 4 
so [0028] There are two different embodiments of the 
present inventKXi. In a first embodiment, me ETS placed 
in me text sxream includes me facial animation. That is. 
me bookmark (escape sequence) is followed by a l 6 bn 
codeword tnai represents tne appropriate facial anima- 
ss i»on to be synchronized won the speech at mis point ih 
me animation 

[0029] Alternatively, me ETS placed m me text stream 
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lion m me FaP stream Specially, me escape se- 
quence is followed by a 16 b.t code that uniquely iden- 
tifies a particular place m me FAP stream. 
[0030] While the present invention has been de- 
scribed in terms ot animation data, the animation data 
could be replaced with natural audio or video data More 
specifically, the above description provides a method 
and system tor aligning animation data with texi-to- 
speech data However, me same method and system 
applies it me texwo-speech data .s replaced with audio 
or v,<ieo in tact, the alignment ot tne two data streams 
»s independent ot the underlying data, at least with re- 
gard to the TTS stream 



Claims 



20 



1 . a metnod for encoding a facial animation including 
at least one tac»ai mimic and speecn m the torm ot 
a text stream, comprising the steps of 

a) assigning a predetermined code to the at 
least one facial mimic, and 
d) placing tne predetermined code withm tne 
text stream , wherein said code indicates a pres- 
ence ot a particular tacial mimic 

2. The method according to claim 1 . wherem the pre- 
determined code acts as a pointer to a stream ot 
tacial mim^s thereby indicating a synchronization 
relationship between the text stream and the facial 
mimic stream 

3. The method according to clam 1 . wherein the pi e- 
determmed code comprises an escape sequence 
toiiowea Dy a plurality of bits, wttich define one ot a 
set ot possible facial mimics 

4. Tne method according to ctaim ^ . turther compns- 
mg the step of placing me predetermined code in 
between words m the text stream. 

5. Tne method according to claim l . tunner compris- 
ing the step ot placing the predetermined code m 
between letters m the text stream 

6. Tne metnod according to ciaim l . tunner compns- 
,ng the s^ep of placing the predetermined code in- 
Side words m tne text stream. 



7, A method for encoding a facial animation compris- 
ing the steps ot 

a) creating a data stream, 

b) creating a facial mimic stream: 

c) inserting a plurality ot pointers »n the data 
stream pointing to a corresponding plurality ot 



in sad plurality or pointers establish a synchro- 
nization relationship with said data and said fa- 
cial mimtcs 

5 8. The method according to claim 7. wherem each of 
tne plurality of pointers comprises a time stamp. 

9. The method according to claim 7, wherein the daia 
stream comprises a text stream that »s to be con- 

?o verted to speech in a decoding process 

10. The method according to claim 9 tunner compris- 
ing the step of placing at least one of the plurality of 
posters in Deiween words in the text stream 

11. Tne method according to claim 9 turtner compris- 
ing the step ot placing at (east one ot the piurwlly ot 
pointers in between syllables m the text stir: -40. 



75 



12, The metnod according to claim 7 funner ccropns- 
mg the step of placing at least one o| the plurality ot 
pointers inside words in the text stream 

13, The method according to claim 7, wherein the data 
35 stream comprises a video stream 

14, The method according to claim 7, wherem tne data 
sueam comprises an audio stream 

30 16. a method lor decoding a facial animatton including 
speech and at least one facial mimic, comprising 
the steps of: 

a) monitoring a text stream for a set ot prede- 
55 termmed codes corresponding to a set ot tacial 

mim»cs; and 

b) sending a signal to a visual decoder to start 
a particular facial mimic upon detecting tne 
presence of one of tne set ot predetermined 

40 codes 

16. The method according to claim ^ 5. wherein tne pre- 
determined code acts as a pointer to a stream of 
racial mimics thereby indicating a synchronization 
45 relationship between the text stream and the facial 
mimic stream 



so 



55 



17. The method according to claim 1 5. wherem the pre- 
determined cooe comprises <an escape sequence 

18, The method according to cia t m 1 5. tunner compns- 
mg the step of placing the predetermined code m 
between words in the text stream 

id. The method according to claim i5. further compns- 
" mg the step ot placing the predetermined code in 
between phonemes m the text stream 
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20. Tne method according io claim 1 5. further compns- 
mg the step ot placing the predetermined code in- 
side woras in the Text stream. 

21, An apparatus tor decoding an encoded animation 
comprising 

a) a demultiplexer receiving me encoded ani- 
mation, outputtmg a text stream and ataoal an- 
imation parameter stream, wherein said text 
stream includes a plurality ot codes indicating 
a synchronization relationship wan a plurality of 
mimics m the facial animation parameter 
stream ana the text in the taxi ©iroam. 
D) a text to speech convener coupled to the de- 
multiplexer, convening tne text stream to 
speech, outputtmg a plurality ot phonemes, and 
a plurality ot real-time time stamps and me plu- 
rality ot cooes in a one-to-one correspondence, 
whereoy the plurality of real-time time stamps 
and the plurality of codes indicate a syncpionr 
zation relationship between the plurality or 
mimics and the plurality ot phonemes, and 
c) a phoneme to video converter being coupled 
to tne text to speech converter, synchronizing 
a plurality of facial mimics won tne plurality ot 
phonemes based on the plurality of reaH*ne 
lane stamps and the plurality of codes 

22. The apparatus according to claim 21 . turtner com- 
prising a compositor converting the speech and vid- 
eo to a composite video signal 

23. The apparatus according to claim 2i, wherein the 
pnoneme to video converter includes 

a) afacial animator creating a wireframe image 
toased on tne synchronised plurality ot pho- 
nemes and the plurahty ol facial mimics, and 
to) a visual decoder being coupled to the demul- 
tiplexer and the facial animator, and rendering 
the video image based on the wireframe image 
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(57) According xo MPEG-4's TTS architecture, taaai 
animation can be driven by two streams simultaneously 
- text ana Facial Animation Parameters, in this archi- 
tecture, text input is sent io a Texi-To-Speech converter 
at a decoder mat anves the mouth shapes ol the face 
Facial Animation Parameters are sent tiom an encoder 
to the tace over the communication channel. The 
present invention includes codes (Known as book- 
marks) tn ihe text siring transmitted to ine Text-to- 
Speecn converter, when DcoKmarKs are placed be- 
tween woras as well as inside them According to the 
present invention, ine Doownarks carry an encoder lime 
stamp. Due to ine nature or texwo-speecn conversion, 
me encoder urne stamp does not relate to real-world 
ume. and should be interpreted as a couniei in addnion, 
tne Facial Animation Parameter stream carries the 
dame encoder time stamp tound m the bookmark ot the 
text The system ot the present invention reads the 
bookmark and provides the encoder lime stamp as well 
as a real-time time stamp to the taciai animation system 
Finally, me tactai animation system associates me cor- 
i ect facial animation parameter wcin the rean*me time 
stamp usmg the encoder ume stamp or me bookmark 
as a reference 
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